Exploring Words with Semantic Correlations from Chinese Wikipedia

نویسندگان

  • Yun Li
  • Kaiyan Huang
  • Seiji Tsuchiya
  • Fuji Ren
  • Yixin Zhong
چکیده

In this paper, we work on semantic correlation between Chinese words based on Wikipedia documents. A corpus with about 50,000 structured documents is generated from Wikipedia pages. Then considering of hyper-links, text overlaps and word frequency, about 300,000 word pairs with semantic correlations are explored from these documents. We roughly measure the degree of semantic correlations and find groups with tight semantic correlations by self clustering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Non-Orthogonal Explicit Semantic Analysis

Explicit Semantic Analysis (ESA) utilizes the Wikipedia knowledge base to represent the semantics of a word by a vector where every dimension refers to an explicitly defined concept like a Wikipedia article. ESA inherently assumes that Wikipedia concepts are orthogonal to each other, therefore, it considers that two words are related only if they co-occur in the same articles. However, two word...

متن کامل

Exploring ESA to Improve Word Relatedness

Explicit Semantic Analysis (ESA) is an approach to calculate the semantic relatedness between two words or natural language texts with the help of concepts grounded in human cognition. ESA usage has received much attention in the field of natural language processing, information retrieval and text analysis, however, performance of the approach depends on several parameters that are included in ...

متن کامل

Unsupervised Sense Clustering of Related Chinese Words

Chinese words which share the same character may carry related but different meanings, e.g., “花錢(spend)”, “花 費(expend)”, “花園(garden)”, “開花(bloom))”. The semantics of these words form two clusters: {“花錢(spend)”, “花費(expend)”} and {“花園(garden)”, “開花(bloom)”}. In this paper, we aim at unsupervised clustering of a given set of such related Chinese words, where the quality of clustering results is t...

متن کامل

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into ncharacter words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training pur...

متن کامل

Measuring of Semantic Relatedness between Words based on Wikipedia Links

A novel technique of semantic relatedness measurement between words based on link structure of Wikipedia was provided. Only Wikipedia’s link information was used in this method, which avoid researchers from burdensome text processing. During the process of relatedness computation, the positive effects of two-directional Wikipedia’s links and four link types are taken into account. Using a widel...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008